Intro to Spatial Analysis

In this notebook, we will be going through some basic spatial analyses. The goal is to provide you with an intuition of the logic behind each of these functions. Try not to get bogged down in understanding every single line of the code, but focus more on the overall reasoning behind what is being done. The code in this notebook is not optimized for large datasets, but simplified versions of more complex functions, to make it more clear what is actually being done.

If you would like to run similar analyses on your own multiplexed imaging datasets, please see our lab’s pipeline here. There are also other toolkits for spatial analysis, including Squidpy and MCMICRO.

import os
import numpy as np
import pandas as pd
import skimage.io as io
from skimage.segmentation import find_boundaries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib import colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy.spatial.distance import cdist
from scipy import stats
import random
from sklearn.cluster import KMeans
import seaborn as sns

1. Inspect your data

Mantis Viewer or napari can be useful for visualizing your data, but it’s always a good idea to be able to open, view, and manipulate your data in Python, as it will give you more flexibility when analyzing your data.

%matplotlib notebook

# Directory where the data lives
data_dir = "example_data"

# Example image
ex_fov = "fov1"

# Look at a few markers
markers = ["CD45","Collagen1"]

# Set-up plots
plt.rcParams['figure.figsize'] = [10, 5]
fig, ax = plt.subplots(1,len(markers), sharex=True, sharey=True)
for i,mark in enumerate(markers):
    im_array = np.array(io.imread(os.path.join(data_dir, ex_fov, "image_data", mark+".tiff")))
    ax[i].imshow(im_array, origin="lower", cmap='gray', vmax=np.quantile(im_array,0.99))
    ax[i].set_title(mark)
    ax[i].axis('off')
plt.tight_layout()

We have already segmented these images to identify the location of single cells in the image using Mesmer. If you are interested in applying Mesmer to your own data, you can see the notebook here.

We can inspect the output of Mesmer here. In the segmentation output, each cell has a unique label. For example, all pixels with the value of 1 belong to the same cell, all pixels with the value of 2 belong to another cell, etc.

# Read in segmentation mask
seg_path = os.path.join(data_dir, ex_fov, "masks", "segmentation_whole_cell.tiff")
seg_array = np.array(io.imread(seg_path)).squeeze() #squeeze changes dimensions from (1,2048,2048) to (2048,2048)

# Set up plot
fig, ax = plt.subplots(figsize=[5,5])
plt.imshow(seg_array)
plt.axis('off')
plt.tight_layout()

Once we have segmented our data, we can generate cell tables where each row corresponds to one cell that was identified in the image. We have already identified the phenotype of each cell using Pixie. If you are interested in running Pixie on your own data, see the notebook here.

The ‘label’ column in the table corresponds to the cell IDs (pixel values in the segmentation output), ‘centroid-0’ and ‘centroid-1’ correspond to the center point of each cell, and the ‘cell_cluster’ column is the cell phenotype that we determined using Pixie.

# Read in cell table
cell_table_path = os.path.join(data_dir, "cell_table.csv")
cell_table = pd.read_csv(cell_table_path)

# Subset for only the example fov we're looking at here
fov_cell_table = cell_table.loc[cell_table['fov'] == ex_fov]
fov_cell_table
fov label centroid-0 centroid-1 cell_cluster
0 fov1 1 2 759 Cancer_Other
1 fov1 2 3 22 Immune_Other
2 fov1 3 3 114 CD4T
3 fov1 4 4 436 CD4T
4 fov1 5 4 164 NK
... ... ... ... ... ...
2777 fov1 2778 2044 114 Monocyte
2778 fov1 2779 2043 282 Other
2779 fov1 2780 2045 187 Cancer_Other
2780 fov1 2781 2044 1488 Treg
2781 fov1 2782 2044 1905 Fibroblast

2782 rows × 5 columns

We can then create a cell phenotype map, where each cell is colored according to its cell phenotype.

# Define colors we want for each cell type
all_colors = {}
all_colors['APC'] = '#4E79A7'
all_colors['B'] = '#A0CBE8'
all_colors['Cancer'] = '#F28E2B'
all_colors['Cancer_EMT'] = '#FFBE7D'
all_colors['Cancer_Other'] = '#59A14F'
all_colors['CD4T'] = '#8CD17D'
all_colors['CD8T'] = '#B6992D'
all_colors['Endothelium'] = '#F1CE63'
all_colors['Fibroblast'] = '#499894'
all_colors['Immune_Other'] = '#86BCB6'
all_colors['M1_Mac'] = '#E15759'
all_colors['M2_Mac'] = '#FF9D9A'
all_colors['Mac_Other'] = '#79706E'
all_colors['Mast'] = '#D4A6C8'
all_colors['Monocyte'] = '#D37295'
all_colors['Neutrophil'] = '#FABFD2'
all_colors['NK'] = '#B07AA1'
all_colors['Other'] = '#BAB0AC'
all_colors['Stroma'] = '#9D7660'
all_colors['T_Other'] = '#D7B5A6'
all_colors['Treg'] = '#FFFF99'

# Create table matching each color to a unique ID
colors_list = [(key,value) for key,value in all_colors.items()]
all_colors_df = pd.DataFrame(colors_list, columns=['cell_cluster','color'])
all_colors_df['pheno_id'] = all_colors_df.index + 1

# Make color map for plotting
mycols = all_colors_df['color'].tolist()
mycols.insert(0,'#000000') # add black for empty slide, will have id 0
mycols.append('#FFFFFF') # add white for cell borders, will have id max_n+1
colmap = colors.ListedColormap(mycols)
max_n = np.max(all_colors_df['pheno_id'])
bounds = [i-0.5 for i in np.linspace(0,max_n+2, max_n+3)]
norm = colors.BoundaryNorm(bounds, colmap.N)

# Define function for making cell phenotype map (CPM)
def create_cpm(fov_name, cell_table, all_colors_df, seg_array, fig, ax):
    # Subset cell table for this FOV
    one_fov_cell_table = cell_table.loc[cell_table['fov'] == fov_name]
    # Combine with cell table
    one_fov_cell_table = one_fov_cell_table.merge(all_colors_df, how='left')
    # Make dictionary mapping each cell to its phenotype id
    fov_cell_dict = dict(zip(one_fov_cell_table['label'], one_fov_cell_table['pheno_id']))
    # Add 0 for empty slide
    fov_cell_dict[0] = 0
    
    # Make new image where each pxiel corresponds to its phenotype id
    # Use 'vectorize' in numpy package to speed up this operation
    cpm_array = np.vectorize(fov_cell_dict.get)(seg_array)

    predicted_contour_mask = find_boundaries(seg_array, connectivity=1, mode='inner').astype(np.uint8)
    cpm_array[predicted_contour_mask > 0] = max_n+1

    # Plot
    cpm_image = colmap(norm(cpm_array))
    ax.imshow(cpm_image)
    
    return
# Create CPM for example FOV
fig, ax = plt.subplots(figsize=[8,8])
cpm = create_cpm(ex_fov, cell_table, all_colors_df, seg_array, fig, ax)
plt.axis('off')

# Add colorbar to image
divider = make_axes_locatable(fig.gca())
cax = divider.append_axes(position="right", size="5%", pad="3%")
cbar = fig.colorbar(cm.ScalarMappable(norm=norm, cmap=colmap),
                    cax=cax, orientation="vertical", use_gridspec=True, pad=0.1,
                    shrink=0.9, drawedges=True)
cbar_labels = all_colors_df['cell_cluster'].to_list()
cbar_labels.insert(0,'Empty') # add black for empty slide, will have id 0
cbar_labels.append('Cell border') # add white for cell borders, will have id max_n+1

cbar.ax.set_yticks(
    ticks=np.arange(len(cbar_labels)),
    labels=cbar_labels
)
cbar.minorticks_off()

plt.tight_layout()

2. Quanitfying cell populations

There are many methods for cell enumeration, including counting the number of cells of each cell type, calculating cell frequency by dividing by the total number of cells, and calculating cell density by dividing by the total tissue area. While the first approach is the most straightforward, simply counting the number of cells depends on the amount of tissue present in the image. Cell frequencies normalize the number of each cell type to the total number of cells in the image such that all cell types sum to 1. As a result, cell frequencies are not confounded by differences in the size of the section. However, one drawback of cell frequencies is that they can obscure the reason that the amount of one cell type differs between samples. For example, consider a simple example of comparing the number of tissue resident macrophages in healthy and inflamed tissue. Even if the number of macrophages is the same in both images, it may seem like tissue resident macrophages are decreasing in the latter relative to the former. In reality, however, the absolute count of tissue resident macrophages is not changing; they are just outnumbered by infiltrating immune cells. Only considering cell frequencies in such a scenario would lead to the incorrect conclusion that macrophages are decreasing in the inflamed state. One solution is to also examine the density of each cell population by dividing by the total tissue area.

Here, we are showing an example of two FOVs that have different amounts of immune infiltrate, ECM content, and total area (as determined using the slide background mask). We have generated masks of the ECM (using the composite signal of ECM markers) and empty sllide (for MIBI, we can determine empty slide by measuring gold signal, since we use gold sputtered slides). Using the empty slide mask, we can estimate tissue area (by taking the inverse). In the ECM and empty slide masks shown here, white indicates ECM or empty slide, respectively.

# Example images
fov1 = "fov1"
fov2 = "fov2"

# Set-up plots
plt.rcParams['figure.figsize'] = [10,6]
fig, ax = plt.subplots(2,3)

# Look at cell phenotype maps, ECM mask, and gold mask
for i,fov in enumerate([fov1,fov2]):
    # CPMs
    seg_path = os.path.join(data_dir,fov,"masks","segmentation_whole_cell.tiff")
    seg_array = np.array(io.imread(seg_path)).squeeze()
    create_cpm(fov, cell_table, all_colors_df, seg_array, fig, ax[i,0])
    ax[i,0].set_title("Cell phenotype map")
    ax[i,0].set_ylabel(fov)
    ax[i,0].set_yticklabels([])
    ax[i,0].set_xticklabels([])
    ax[i,0].set_yticklabels([])
    ax[i,0].set_xticks([])
    ax[i,0].set_yticks([])
    
    # ECM
    ecm_path = os.path.join(data_dir,fov,"masks","total_ecm.tiff")
    ecm_array = np.array(io.imread(ecm_path))
    ax[i,1].imshow(ecm_array, cmap='gray')
    ax[i,1].axis('off')
    ax[i,1].set_title("ECM")
    
    # Empty slide
    empty_slide_path = os.path.join(data_dir,fov,"masks","empty_slide.tiff")
    empty_slide_array = np.array(io.imread(empty_slide_path))
    ax[i,2].imshow(empty_slide_array, cmap='gray')
    ax[i,2].axis('off')
    ax[i,2].set_title("Empty slide")

plt.tight_layout()

Simply count the number of each cell type

cell_table_keep = cell_table.loc[cell_table['fov'].isin([fov1,fov2])]
count_cells = cell_table_keep.groupby('fov')['cell_cluster'].value_counts().reset_index(name='count')
count_cells
fov cell_cluster count
0 fov1 CD4T 630
1 fov1 Cancer 361
2 fov1 CD8T 341
3 fov1 APC 282
4 fov1 Immune_Other 218
5 fov1 M2_Mac 181
6 fov1 Fibroblast 148
7 fov1 Stroma 104
8 fov1 Cancer_EMT 82
9 fov1 Monocyte 80
10 fov1 B 78
11 fov1 Treg 67
12 fov1 Endothelium 50
13 fov1 Cancer_Other 40
14 fov1 M1_Mac 34
15 fov1 T_Other 32
16 fov1 NK 23
17 fov1 Mac_Other 16
18 fov1 Mast 10
19 fov1 Other 4
20 fov1 Neutrophil 1
21 fov2 CD4T 2260
22 fov2 B 1244
23 fov2 Cancer 636
24 fov2 Treg 492
25 fov2 CD8T 458
26 fov2 Immune_Other 396
27 fov2 Endothelium 284
28 fov2 Monocyte 239
29 fov2 APC 229
30 fov2 Fibroblast 208
31 fov2 Cancer_Other 139
32 fov2 M1_Mac 134
33 fov2 Stroma 133
34 fov2 T_Other 132
35 fov2 Other 58
36 fov2 Cancer_EMT 51
37 fov2 M2_Mac 46
38 fov2 Mac_Other 17
39 fov2 Mast 10
40 fov2 NK 6

Calculate frequency (divided by total number of cells)

# Get total number of cells per FOV
total_counts = cell_table_keep.groupby('fov').size().to_frame('total_cells')
total_counts = total_counts.reset_index()

# Calculate frequency
count_cells = count_cells.merge(total_counts, on='fov')
count_cells['frequency'] = count_cells['count'] / count_cells['total_cells']
count_cells
fov cell_cluster count total_cells frequency
0 fov1 CD4T 630 2782 0.226456
1 fov1 Cancer 361 2782 0.129763
2 fov1 CD8T 341 2782 0.122574
3 fov1 APC 282 2782 0.101366
4 fov1 Immune_Other 218 2782 0.078361
5 fov1 M2_Mac 181 2782 0.065061
6 fov1 Fibroblast 148 2782 0.053199
7 fov1 Stroma 104 2782 0.037383
8 fov1 Cancer_EMT 82 2782 0.029475
9 fov1 Monocyte 80 2782 0.028756
10 fov1 B 78 2782 0.028037
11 fov1 Treg 67 2782 0.024083
12 fov1 Endothelium 50 2782 0.017973
13 fov1 Cancer_Other 40 2782 0.014378
14 fov1 M1_Mac 34 2782 0.012221
15 fov1 T_Other 32 2782 0.011503
16 fov1 NK 23 2782 0.008267
17 fov1 Mac_Other 16 2782 0.005751
18 fov1 Mast 10 2782 0.003595
19 fov1 Other 4 2782 0.001438
20 fov1 Neutrophil 1 2782 0.000359
21 fov2 CD4T 2260 7172 0.315114
22 fov2 B 1244 7172 0.173452
23 fov2 Cancer 636 7172 0.088678
24 fov2 Treg 492 7172 0.068600
25 fov2 CD8T 458 7172 0.063859
26 fov2 Immune_Other 396 7172 0.055215
27 fov2 Endothelium 284 7172 0.039598
28 fov2 Monocyte 239 7172 0.033324
29 fov2 APC 229 7172 0.031930
30 fov2 Fibroblast 208 7172 0.029002
31 fov2 Cancer_Other 139 7172 0.019381
32 fov2 M1_Mac 134 7172 0.018684
33 fov2 Stroma 133 7172 0.018544
34 fov2 T_Other 132 7172 0.018405
35 fov2 Other 58 7172 0.008087
36 fov2 Cancer_EMT 51 7172 0.007111
37 fov2 M2_Mac 46 7172 0.006414
38 fov2 Mac_Other 17 7172 0.002370
39 fov2 Mast 10 7172 0.001394
40 fov2 NK 6 7172 0.000837

Calculate density (divided by tissue area)

# Calculate tissue area (inverse of empty slide)
def get_tissue_area(fov_name):
    empty_slide_path = os.path.join(data_dir,fov_name,"masks","empty_slide.tiff")
    empty_slide_array = np.array(io.imread(empty_slide_path))
    # Get total pixels that belong to empty slide
    empty_slide_px = np.sum(empty_slide_array) #pixel value is 1 if it is empty slide
    # Get number of pixels that belong to tissue (total image - empty slide)
    total_size = empty_slide_array.shape
    tissue_area_px = (total_size[0]*total_size[1]) - empty_slide_px
    # These images are 2048px x 2048px, imaged at 800um x 800um
    # Therefore, the size of 1 pixel is 800um/2048px = 0.39 um/px, area is 0.15 um^2
    return (fov_name, tissue_area_px*0.15)

# Calculate for all FOVs
all_tissue_px = [get_tissue_area(fov) for fov in [fov1,fov2]]
all_tissue_df = pd.DataFrame(all_tissue_px, columns=['fov','area'])

# Calculate density
count_cells = count_cells.merge(all_tissue_df, on='fov')
count_cells['density'] = count_cells['count'] / count_cells['area']
count_cells
fov cell_cluster count total_cells frequency area density
0 fov1 CD4T 630 2782 0.226456 596102.55 0.001057
1 fov1 Cancer 361 2782 0.129763 596102.55 0.000606
2 fov1 CD8T 341 2782 0.122574 596102.55 0.000572
3 fov1 APC 282 2782 0.101366 596102.55 0.000473
4 fov1 Immune_Other 218 2782 0.078361 596102.55 0.000366
5 fov1 M2_Mac 181 2782 0.065061 596102.55 0.000304
6 fov1 Fibroblast 148 2782 0.053199 596102.55 0.000248
7 fov1 Stroma 104 2782 0.037383 596102.55 0.000174
8 fov1 Cancer_EMT 82 2782 0.029475 596102.55 0.000138
9 fov1 Monocyte 80 2782 0.028756 596102.55 0.000134
10 fov1 B 78 2782 0.028037 596102.55 0.000131
11 fov1 Treg 67 2782 0.024083 596102.55 0.000112
12 fov1 Endothelium 50 2782 0.017973 596102.55 0.000084
13 fov1 Cancer_Other 40 2782 0.014378 596102.55 0.000067
14 fov1 M1_Mac 34 2782 0.012221 596102.55 0.000057
15 fov1 T_Other 32 2782 0.011503 596102.55 0.000054
16 fov1 NK 23 2782 0.008267 596102.55 0.000039
17 fov1 Mac_Other 16 2782 0.005751 596102.55 0.000027
18 fov1 Mast 10 2782 0.003595 596102.55 0.000017
19 fov1 Other 4 2782 0.001438 596102.55 0.000007
20 fov1 Neutrophil 1 2782 0.000359 596102.55 0.000002
21 fov2 CD4T 2260 7172 0.315114 624968.25 0.003616
22 fov2 B 1244 7172 0.173452 624968.25 0.001991
23 fov2 Cancer 636 7172 0.088678 624968.25 0.001018
24 fov2 Treg 492 7172 0.068600 624968.25 0.000787
25 fov2 CD8T 458 7172 0.063859 624968.25 0.000733
26 fov2 Immune_Other 396 7172 0.055215 624968.25 0.000634
27 fov2 Endothelium 284 7172 0.039598 624968.25 0.000454
28 fov2 Monocyte 239 7172 0.033324 624968.25 0.000382
29 fov2 APC 229 7172 0.031930 624968.25 0.000366
30 fov2 Fibroblast 208 7172 0.029002 624968.25 0.000333
31 fov2 Cancer_Other 139 7172 0.019381 624968.25 0.000222
32 fov2 M1_Mac 134 7172 0.018684 624968.25 0.000214
33 fov2 Stroma 133 7172 0.018544 624968.25 0.000213
34 fov2 T_Other 132 7172 0.018405 624968.25 0.000211
35 fov2 Other 58 7172 0.008087 624968.25 0.000093
36 fov2 Cancer_EMT 51 7172 0.007111 624968.25 0.000082
37 fov2 M2_Mac 46 7172 0.006414 624968.25 0.000074
38 fov2 Mac_Other 17 7172 0.002370 624968.25 0.000027
39 fov2 Mast 10 7172 0.001394 624968.25 0.000016
40 fov2 NK 6 7172 0.000837 624968.25 0.000010

Compare

cell_type = 'CD4T'
print(cell_type)
print('FOV:', fov1,
      ', total_counts:', count_cells.loc[(count_cells['fov']==fov1) & (count_cells['cell_cluster']==cell_type)]['count'].values[0],
      ', frequency:', count_cells.loc[(count_cells['fov']==fov1) & (count_cells['cell_cluster']==cell_type)]['frequency'].values[0],
      ', density:', count_cells.loc[(count_cells['fov']==fov1) & (count_cells['cell_cluster']==cell_type)]['density'].values[0])
print('FOV:', fov2,
      ', total_counts:', count_cells.loc[(count_cells['fov']==fov2) & (count_cells['cell_cluster']==cell_type)]['count'].values[0],
      ', frequency:', count_cells.loc[(count_cells['fov']==fov2) & (count_cells['cell_cluster']==cell_type)]['frequency'].values[0],
      ', density:', count_cells.loc[(count_cells['fov']==fov2) & (count_cells['cell_cluster']==cell_type)]['density'].values[0])
CD4T
FOV: fov1 , total_counts: 630 , frequency: 0.22645578720345075 , density: 0.0010568651316791046
FOV: fov2 , total_counts: 2260 , frequency: 0.3151143335192415 , density: 0.0036161837021320684

Additional exercises

  1. Change the “cell_type” in the code block above and evaluate how the quantification method is different for different cell types.
  2. Create a stacked bar plot showing the cell type composition of each fov (x axis should be fov1 and fov2, y axis is cell type composition). Try making this plot with counts, frequency, and density.

3. Cell-cell enrichment (global)

The goal of pairwise enrichment analysis is to assess whether any two given cell populations colocalize with each other. For example, this approach could be used to determine if T cells preferentially colocalize with tumor cells. In asking such questions, we recommend taking this one step further: Do two cell populations colocalize with each other more often than would be expected by chance? Depending on the frequency of the cell populations and native tissue structure, it may not be possible to determine whether some pairwise relationships are truly preferential or random. The goal of this approach is to minimize potential confounding effects specific to each image that might not be related to the biological question of interest. For example, in a tissue section equally composed of only two cell populations, pairwise enrichment is likely to occur simply by random chance.

With this in mind, we can assess the statistical significance of pairwise enrichment by comparing how often two cell populations are found in close proximity to each other compared with a null distribution. To construct a null distribution, we use a bootstrapping approach by selectively randomizing the location of the cell populations of interest, calculating pairwise enrichment, and repeating this process a large number of times (>100 times is typical). In the assessment of preferential enrichment for two cell populations defined by their expression of marker A or B, the simplest way to generate a null distribution is to randomize the location of one cell population across all cells in the image while keeping the location of the other cell population fixed. For both the original and randomized images, cells positive for A or B are colocalized if the distance between them is less than or equal to a user-defined value. The number of A–B interactions in the original image is then compared with the null distribution to determine statistical significance.

# Example image
ex_fov = "fov1"

# Determine cell types to look at
pheno1 = "Cancer"
pheno2 = "CD8T"

# Threshold to determine if two cells are "close"
dist_thresh = 50

# Number of bootstraps for generating null distribution
bootstrap_n = 100

Calculate the distance between all cells in the FOV

# Subset cell table for only cells in this FOV
fov_cell_table = cell_table.loc[cell_table['fov'] == ex_fov].reset_index(drop=True)
# Make list of all cell centroids
all_centroids = list(zip(fov_cell_table['centroid-0'],fov_cell_table['centroid-1']))
# Get distance between all cells
dist_mat = cdist(all_centroids, all_centroids, 'euclidean')
# Print dimensions of distance matrix
print("Dimensions of dist_mat: ", dist_mat.shape)

dist_mat
Dimensions of dist_mat:  (2782, 2782)
array([[   0.        ,  737.00067843,  645.00077519, ..., 2121.56381002,
        2168.22623358, 2341.59774513],
       [ 737.00067843,    0.        ,   92.        , ..., 2048.6554127 ,
        2512.93394263, 2776.93536115],
       [ 645.00077519,   92.        ,    0.        , ..., 2043.30443155,
        2460.39773208, 2715.39352581],
       ...,
       [2121.56381002, 2048.6554127 , 2043.30443155, ...,    0.        ,
        1301.00038432, 1718.00029104],
       [2168.22623358, 2512.93394263, 2460.39773208, ..., 1301.00038432,
           0.        ,  417.        ],
       [2341.59774513, 2776.93536115, 2715.39352581, ..., 1718.00029104,
         417.        ,    0.        ]])

This distance matrix has dimensions of total number of cells x total number of cells. The values in the matrix indicate the distance between any 2 cells. The indices of the distance matrix correspond to the indices of the cell table. So the first row in the distance matrix corresponds to the distance between cell label 1 and all other cells in the image. After calculating the distance between all cells, we can subset it for our cell type of choice. We can then count the number of close contacts between two cell types.

Count number of close contacts between pheno1 and pheno2

# Get index of cells belonging to pheno1 and pheno2
pheno1_idx = fov_cell_table[fov_cell_table['cell_cluster'] == pheno1].index.to_list()
pheno2_idx = fov_cell_table[fov_cell_table['cell_cluster'] == pheno2].index.to_list()

# Only keep pheno1 cells in x-axis of distance matrix
pheno1_dist_mat = dist_mat[pheno1_idx,:]
# Binarize the distance matrix for distances that are within the defined threshold
bin_mask = (pheno1_dist_mat < dist_thresh) & (pheno1_dist_mat > 0)
# Change true/false to 1/0
pheno1_dist_mat_bin = bin_mask*1

# Subset this distance matrix for pheno2 cells in y-axis of distance matrix
true_dist_mat_bin = pheno1_dist_mat_bin[:,pheno2_idx]
# Inspect the shape of this matrix, should be number of cells of pheno1 x number of cells of pheno2
# Each element in the matrix is the distance between a pheno1 cell and a pheno2 cell
print("Shape of subsetted distance matrix: ", true_dist_mat_bin.shape)

# For each pheno1 cell, count number of "close" contacts with pheno2 cells
true_close_contacts = np.sum(true_dist_mat_bin, axis=1)
# Take the average across all pheno1 cells
true_close_contacts_mean = np.mean(true_close_contacts)
print("Average number of close contacts between pheno1 and pheno2 cells: ", true_close_contacts_mean)
Shape of subsetted distance matrix:  (361, 341)
Average number of close contacts between pheno1 and pheno2 cells:  0.5207756232686981

Above, we have determined the average number of close contacts between pheno1 and pheno2 cells in this image. To determine whether this is a significant number, we can compare it against a null distribution, generated by randomly permuting cell labels. We keep the pheno1 cells constant, then randomize the location of pheno2 cells in the image. For each randomization, we calculate the number of close contacts. We repeat this many times to generate the null distribution.

Generate null distribution by bootstrapping

# Get all possible cell indices (total pool of available cells to randomize)
all_idx = fov_cell_table.index.to_list()
# Remove cells that are of pheno1 from this pool (since they are held constant in this randomization)
all_idx = [x for x in all_idx if x not in pheno1_idx]
# Get total number of cells that are pheno2
num_pheno2 = len(pheno2_idx)

# Randomly sample all cells to be labeled as pheno2 (bootstrapping)
all_bootstrap = []
for _ in range(bootstrap_n):
    # Select num_pheno2 random numbers, represents the indices of the randomly selected cells
    random_pheno2_idx = random.sample(all_idx, num_pheno2)
    # Subset the distance matrix to only keep these randomly selected cells
    keep_dist_mat_bin = pheno1_dist_mat_bin[:,random_pheno2_idx]
    # Find the total number of close contacts between pheno1 cells and randomly selected cells
    close_contacts = np.sum(keep_dist_mat_bin, axis=1)
    # Take the mean across all cells of pheno1
    close_contacts_mean = np.mean(close_contacts)
    # Add this value to the list of all bootstraps
    all_bootstrap.append(close_contacts_mean)

Compare null distrbution to actual number of close contacts

fig, ax = plt.subplots(figsize=(5,3))
# Blue histogram is null distribution
ax.hist(all_bootstrap, density=True,  bins=10, alpha=0.5)
# Red line is actual number of close contacts
plt.axvline(x=true_close_contacts_mean, color='red', linestyle='--', linewidth=2)
plt.show()
# Calculate statistics of null distribution
muhat, sigmahat = stats.norm.fit(all_bootstrap)
# Calculate z score based on distribution
z = (true_close_contacts_mean - muhat) / sigmahat
print("z-score: ", z)
z-score:  1.7648982094060992

Additional exercises

  1. Flip “pheno1” and “pheno2” - is this calculation symmetric? Conceptually, should it be symmetric?
  2. Play around with the distance threshold for determing “close” cells.
  3. Play around with “pheno1” and “pheno2” and evaluate these metrics for various pairs of cells.
  4. Pick a few different pheno1-pheno2 pairs. For one FOV, loop through this pairwise enrichment function for these pairs of cells. Store this output in a table (hint: column names could be “pheno1”, “pheno2”, “z-score”).
  5. Repeat this calculation for the second FOV.
  6. Compare z-scores between fov1 and fov2.

4. Cell-cell enrichment (context dependent)

This randomization strategy is robust for mitigating confounding effects attributable to the frequency of each cell population. However, this approach does not control for biases inherent to tissue structure. Consequently, interactions between two proteins may appear to be spatially enriched as a result of tissue structure or cell-specific expression, as markers expressed exclusively by certain cells will be heavily influenced by the location of those cells. For example, two immune cell populations that are known to preferentially localize in germinal centers are biased to be enriched with each other. While one possibility is that this relationship is indeed preferential, another possibility is that spatial enrichment is merely a result of both cell types being restricted to a smaller histological compartment.

To address this possibility, null distributions can also be generated in a context-dependent manner. In context-dependent spatial enrichment analysis, randomizations are restricted to occur only within a given cellular compartment or cell subset. In the germnial center example, randomizations can be restricted to occur only within the germinal center. As a result, spurious enrichments that might occur between noninteracting cell populations that happen to colocalize to germinal centers are accounted for in the null distribution.

For this dataset, we have generated masks for the cancer and stroma, core and border regions (using composite channels representative for each region).

Look at cancer and stroma masks

cancer_core = io.imread(os.path.join(data_dir, ex_fov, "masks", "cancer_core.tiff"))
cancer_border = io.imread(os.path.join(data_dir, ex_fov, "masks", "cancer_border.tiff"))
stroma_core = io.imread(os.path.join(data_dir, ex_fov, "masks", "stroma_core.tiff"))
stroma_border = io.imread(os.path.join(data_dir, ex_fov, "masks", "stroma_border.tiff"))

fig, ax = plt.subplots(1,4,figsize=[12,4])
ax[0].imshow(cancer_core, cmap='gray')
ax[0].axis('off')
ax[0].set_title("Cancer core")
ax[1].imshow(cancer_border, cmap='gray')
ax[1].axis('off')
ax[1].set_title("Cancer border")
ax[2].imshow(stroma_core, cmap='gray')
ax[2].axis('off')
ax[2].set_title("Stroma core")
ax[3].imshow(stroma_border, cmap='gray')
ax[3].axis('off')
ax[3].set_title("Stroma border")
plt.tight_layout()
plt.show()

Repeat calculation for number of close contacts but subset for cells within the cancer core only

# Get segmentation mask
seg_path = os.path.join(data_dir, ex_fov, "masks", "segmentation_whole_cell.tiff")
seg_array = np.array(io.imread(seg_path)).squeeze()

# Make new segmentation mask where everything outside of the cancer core is 0
masked_seg_array = np.copy(seg_array)
masked_seg_array[cancer_core == 0] = 0
# Only keep cell labels that are within the cancer core
masked_cell_labels = np.unique(masked_seg_array)
masked_cell_labels = [x for x in masked_cell_labels if x != 0] #remove 0 (which is empty slide)
# Subset cell table for only the cells within the cancer core
fov_cell_table_masked = fov_cell_table[fov_cell_table['label'].isin(masked_cell_labels)]
# Get index of cells belonging to pheno1 and pheno2, but this time, only keep labels that are in the cancer core mask
pheno1_idx = fov_cell_table_masked[fov_cell_table_masked['cell_cluster'] == pheno1].index.to_list()
pheno2_idx = fov_cell_table_masked[fov_cell_table_masked['cell_cluster'] == pheno2].index.to_list()

# Only keep pheno1 cells in x-axis of distance matrix
pheno1_dist_mat = dist_mat[pheno1_idx,:]
# Binarize the distance matrix for distances that are within the defined threshold
bin_mask = (pheno1_dist_mat < dist_thresh) & (pheno1_dist_mat > 0)
# Change true/false to 1/0
pheno1_dist_mat_bin = bin_mask*1

# Subset this distance matrix for pheno2 cells in y-axis of distance matrix
true_dist_mat_bin = pheno1_dist_mat_bin[:,pheno2_idx]

# For each pheno1 cell, count number of "close" contacts with pheno2 cells
true_close_contacts = np.sum(true_dist_mat_bin, axis=1)
true_close_contacts_mean = np.mean(true_close_contacts)
print("Average number of close contacts between pheno1 and pheno2 cells: ", true_close_contacts_mean)
Average number of close contacts between pheno1 and pheno2 cells:  0.0989010989010989

Generate null distribution, only randomizing within cancer core

# Get all possible cell indices (total pool of available cells to randomize)
all_idx = fov_cell_table_masked.index.to_list()
# Remove cells that are of pheno1 from this pool (since they are held constant in this randomization)
all_idx = [x for x in all_idx if x not in pheno1_idx]
# Get total number of cells that are pheno2
num_pheno2 = len(pheno2_idx)

# Randomly sample all cells to be labeled as pheno2 (bootstrapping)
all_bootstrap = []
for _ in range(bootstrap_n):
    random_pheno2_idx = random.sample(all_idx, num_pheno2)
    keep_dist_mat_bin = pheno1_dist_mat_bin[:,random_pheno2_idx]
    close_contacts = np.sum(keep_dist_mat_bin, axis=1)
    close_contacts_mean = np.mean(close_contacts)
    all_bootstrap.append(close_contacts_mean)

Compare null distrbution to actual number of close contacts

fig, ax = plt.subplots(figsize=(5,3))
# Blue histogram is null distribution
ax.hist(all_bootstrap, density=True,  bins=10, alpha=0.5)
# Red line is actual number of close contacts
plt.axvline(x=true_close_contacts_mean, color='red', linestyle='--', linewidth=2)
plt.show()
# Calculate statistics of null distribution
muhat, sigmahat = stats.norm.fit(all_bootstrap)
# Calculate z score based on distribution
z = (true_close_contacts_mean - muhat) / sigmahat
print("z-score: ", z)
z-score:  0.23586188717669343

Additional exercises

  1. Play with different regions and phenotype pairs.

5. Cellular microenvironments

Cell phenotypes are regulated in part by their local cellular niche, or microenvironment, which is defined as a collection of cell lineages found to be spatially colocalized and conserved across the data set. We can employ several methods to identify these microenvironments. The first step in this process is to quantify the local distribution of cell lineages around each cell in the data set. Here, we are enumerating the number of cells in each lineage within a set pixel radius from the index cell. Various machine learning methods can then be used to identify distinct cellular microenvironments that occur repeatedly across the data set. Here, we are using k-means clustering.

# FOVs to include in this analysis
all_fovs = ["fov1", "fov2"]

# Set distance for the neighborhood around the cell
nh_dist = 50

# Set the number of neighborhoods
k = 3

# Set a seed for reproducibility
seed = 2024

# Get list of all cell types in our data
all_cell_types = cell_table['cell_cluster'].unique()

Find neighbors for each cell

# Define function to count the neighbors of each cell (where each cell is one row in the distance matrix)
def one_cell_nh(row):
    # Get index of neighbors (defined as "close" in the distance matrix)
    neighbor_idx = np.where(row==1)[0]
    # Initialize output as a dictionary, start with count of 0 for every cell type
    nh = {key:0 for key in all_cell_types}
    # Convert these indices to cell types
    if len(neighbor_idx) > 0:
        nh_cells = fov_cell_table.loc[neighbor_idx]['cell_cluster'].values
        # Count the number of each cell type
        unique_cell_type, counts = np.unique(nh_cells, return_counts=True)
        # Store the data in the dictionary
        for i,one_cell_type in enumerate(unique_cell_type):
            nh[one_cell_type] = counts[i]
    return nh

# Iterate through all fovs to get neighbors of all cells
all_nh_list = [] #store information for each fov in a list
for fov in all_fovs:
    # Subset cell table for only cells in this FOV
    fov_cell_table = cell_table.loc[cell_table['fov'] == fov].reset_index(drop=True)
    # Make list of all cell centroids
    all_centroids = list(zip(fov_cell_table['centroid-0'],fov_cell_table['centroid-1']))
    # Get distance between all cells
    dist_mat = cdist(all_centroids, all_centroids, 'euclidean')

    # Binarize distance matrix (1 if two cells are "close")
    dist_mat_bin = dist_mat < nh_dist
    dist_mat_bin = dist_mat_bin*1

    # Remove itself as its own neighbor
    dist_mat_bin[dist_mat == 0] = 0
    
    # Apply function to every cell in the data
    nh_dicts = np.apply_along_axis(one_cell_nh, axis=1, arr=dist_mat_bin)
    # Turn into dataframe
    nh_df = pd.DataFrame(nh_dicts.tolist())
    # Get total number of neighbors
    nh_df['total_neighbors'] = nh_df.sum(axis=1)
    # Combine with data from cell table
    nh_df = pd.merge(fov_cell_table, nh_df, left_index=True, right_index=True)
    all_nh_list.append(nh_df)

# Turn data into dataframe
all_nh_df = pd.concat(all_nh_list)
all_nh_df
fov label centroid-0 centroid-1 cell_cluster Cancer_Other Immune_Other CD4T NK Stroma ... T_Other Treg Other Endothelium B Neutrophil Mac_Other Cancer_EMT M1_Mac total_neighbors
0 fov1 1 2 759 Cancer_Other 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 fov1 2 3 22 Immune_Other 1 0 1 0 1 ... 0 0 0 0 0 0 0 0 0 5
2 fov1 3 3 114 CD4T 0 1 2 0 0 ... 0 0 0 0 0 0 0 0 0 5
3 fov1 4 4 436 CD4T 0 2 5 0 0 ... 0 0 0 0 0 0 0 0 0 7
4 fov1 5 4 164 NK 0 0 2 0 0 ... 0 0 0 0 0 0 0 0 0 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7167 fov2 7168 2044 1930 CD4T 0 1 0 0 0 ... 0 1 0 0 0 0 0 0 0 3
7168 fov2 7169 2045 250 Immune_Other 0 2 0 0 0 ... 0 1 2 0 1 0 0 0 0 9
7169 fov2 7170 2044 1128 CD4T 0 1 3 0 1 ... 0 0 2 0 0 0 0 0 0 7
7170 fov2 7171 2044 1231 M1_Mac 0 0 5 0 0 ... 0 0 0 0 5 0 0 0 1 11
7171 fov2 7172 2043 1464 CD8T 0 0 5 0 0 ... 0 3 0 0 1 0 0 0 0 10

9954 rows × 27 columns

k-means clustering

When finding the neighborhood, you can either use the number of cells in the neighborhood of each cell, or the frequency (divided by the total number of cells). The interpretation of each of these is slightly different. We encourage you to try both ways (you can comment/uncomment the line of code that calculates frequency below). By clustering on the neighborhoods, we can identify distinct microenvironments.

# If you want to try clustering using frequency (meaning dividing by the total number of cells, uncomment this line
# all_nh_df[all_cell_types] = all_nh_df[all_cell_types].divide(all_nh_df['total_neighbors'], axis=0)

# Remove cells that have no neighbors
all_nh_df_zeros_removed = all_nh_df.loc[all_nh_df['total_neighbors'] > 0].copy()

# Only keep columns we want for clustering
kmeans_input = all_nh_df_zeros_removed[all_cell_types]

# Cluster
cluster_fit = KMeans(n_clusters=k, random_state=seed, n_init=10).fit(kmeans_input)
# Add 1 to labels to avoid cluster number 0 (because output of kmeans is 0-indexed)
cluster_labels = ["ME"+str(x) for x in cluster_fit.labels_+1]

# Add cluster labels to cell table
all_nh_df_zeros_removed['me_cluster'] = cluster_labels

# Merge with the big cell table, assign unassigned cells (becaue no neighbors) to k+1
all_nh_df = all_nh_df.merge(all_nh_df_zeros_removed, how="left")
all_nh_df.me_cluster.fillna("Unassigned", inplace=True)
all_nh_df
/tmp/ipykernel_352/1057615328.py:20: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  all_nh_df.me_cluster.fillna("Unassigned", inplace=True)
fov label centroid-0 centroid-1 cell_cluster Cancer_Other Immune_Other CD4T NK Stroma ... Treg Other Endothelium B Neutrophil Mac_Other Cancer_EMT M1_Mac total_neighbors me_cluster
0 fov1 1 2 759 Cancer_Other 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 Unassigned
1 fov1 2 3 22 Immune_Other 1 0 1 0 1 ... 0 0 0 0 0 0 0 0 5 ME3
2 fov1 3 3 114 CD4T 0 1 2 0 0 ... 0 0 0 0 0 0 0 0 5 ME3
3 fov1 4 4 436 CD4T 0 2 5 0 0 ... 0 0 0 0 0 0 0 0 7 ME3
4 fov1 5 4 164 NK 0 0 2 0 0 ... 0 0 0 0 0 0 0 0 5 ME3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9949 fov2 7168 2044 1930 CD4T 0 1 0 0 0 ... 1 0 0 0 0 0 0 0 3 ME3
9950 fov2 7169 2045 250 Immune_Other 0 2 0 0 0 ... 1 2 0 1 0 0 0 0 9 ME3
9951 fov2 7170 2044 1128 CD4T 0 1 3 0 1 ... 0 2 0 0 0 0 0 0 7 ME3
9952 fov2 7171 2044 1231 M1_Mac 0 0 5 0 0 ... 0 0 0 5 0 0 0 1 11 ME1
9953 fov2 7172 2043 1464 CD8T 0 0 5 0 0 ... 3 0 0 1 0 0 0 0 10 ME3

9954 rows × 28 columns

Visualize microenvironments

We can visualize the number of each cell type that is assigned to each microenvironment.

# Count the number of each cell type assigned to each microenvironment
num_cell_types_dat_long = all_nh_df_zeros_removed.groupby(['me_cluster', 'cell_cluster']).size().reset_index(name='counts')
# Reformat this table
num_cell_types_dat = num_cell_types_dat_long.pivot(index='me_cluster', columns='cell_cluster', values='counts')
num_cell_types_dat.fillna(0, inplace=True)

# Make heatmap
heatmap = sns.clustermap(
    num_cell_types_dat.apply(stats.zscore, axis=1), # Apply a z-score for better visualization
    cmap = "vlag",
    center = 0,
    vmin = -3,
    vmax = 3
)

Visualize cell phenotypes maps, colored by microenvironment

# Define colors we want for each microenvironment (add colors here if you have larger k, you need k+1 colors)
all_colors = {}
all_colors['ME1'] = '#4E79A7'
all_colors['ME2'] = '#59A14F'
all_colors['ME3'] = '#D37295'
all_colors['Unassigned'] = '#79706E'

# Create table matching each color to a unique ID
colors_list = [(key,value) for key,value in all_colors.items()]
all_colors_df = pd.DataFrame(colors_list, columns=['me_cluster','color'])
all_colors_df['pheno_id'] = all_colors_df.index + 1

# Make color map for plotting
mycols = all_colors_df['color'].tolist()
mycols.insert(0,'#000000') # add black for empty slide, will have id 0
mycols.append('#FFFFFF') # add white for cell borders, will have id max_n+1
colmap = colors.ListedColormap(mycols)
max_n = np.max(all_colors_df['pheno_id'])
bounds = [i-0.5 for i in np.linspace(0,max_n+2, max_n+3)]
norm = colors.BoundaryNorm(bounds, colmap.N)
# Create microenvironment masks
for fov in all_fovs:
    # Get segentation array
    seg_path = os.path.join(data_dir, fov, "masks", "segmentation_whole_cell.tiff")
    seg_array = io.imread(seg_path).squeeze()
    
    # Make CPM using function we created above
    fig, ax = plt.subplots(figsize=[8,8])
    cpm = create_cpm(fov, all_nh_df, all_colors_df, seg_array, fig, ax)
    plt.axis('off')
    ax.set_title(fov)

    # Add colorbar
    divider = make_axes_locatable(fig.gca())
    cax = divider.append_axes(position="right", size="5%", pad="3%")
    cbar = fig.colorbar(cm.ScalarMappable(norm=norm, cmap=colmap),
                        cax=cax, orientation="vertical", use_gridspec=True, pad=0.1,
                        shrink=0.9, drawedges=True)
    cbar_labels = all_colors_df['me_cluster'].to_list()
    cbar_labels.insert(0,'Empty') # add black for empty slide, will have id 0
    cbar_labels.append('Cell border') # add white for cell borders, will have id max_n+1

    cbar.ax.set_yticks(
        ticks=np.arange(len(cbar_labels)),
        labels=cbar_labels
    )
    cbar.minorticks_off()

    plt.tight_layout()

Additional exercises

  1. Compare generating neighborhoods using counts vs. frequency.
  2. Change the k used for k-means clustering. How do the results change?
  3. Inertia and silhouette score are two metrics that could be useful for helping us choose the best k. Try different k’s and extracting the inertia value (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, intertia_ is an attribute of the output of k-means). Plot k on the x axis and inertia on the y axis. What trend do you see?